RESEARCH PAPER 



Mobile Genetic Elements 2:1, 19-25; January/February 2012; © 2012 Landes Bioscience 

Do human transposable element small RNAs 
serve primarily as genome defenders or genome 

regulators? 

Kevin J. Lee,^ Andrew B. Conley/ Victoria V. Lunyak^ and I. King Jordan^'^'^ 

^School of Biology; Georgia Institute of Technology; Atlanta, GA USA; ^Buck Institute for Age Research; Novato, CA USA; ^PanAmerican Bioinformatics Institute; Santa Marta, 

Magdalena, Colombia 

Keywords: gene expression, genome regulation, RNA processing, small RNA, RNA interference 

Abbreviations: mRNA, messenger RNA; miRNA, microRNA; piRNA, PIWI interacting RNA; sRNA, small RNA; 

TE, transposable element; UTR, untranslated region 



It is currently thouglit that small RNA (sRNA) based repression mechanisms are primarily employed to mitigate the 
mutagenic threat posed by the activity of transposable elements (TEs). This can be achieved by the sRNA guided 
processing of TE transcripts via Dicer-dependent (e.g., siRNA) or Dicer-independent (e.g., piRNA) mechanisms. For 
example, potentially active human LI elements are silenced by mRNA cleavage induced by element encoded siRNAs, 
leading to a negative correlation between element mRNA and siRNA levels. On the other hand, there is emerging 
evidence that TE derived sRNAs can also be used to regulate the host genome. Here, we evaluated these two hypotheses 
for human TEs by comparing the levels of TE derived mRNA and TE sRNA across six tissues. The genome defense 
hypothesis predicts a negative correlation between TE mRNA and TE sRNA levels, whereas the genome regulatory 
hypothesis predicts a positive correlation. On average, TE mRNA and TE sRNA levels are positively correlated across 
human tissues. These correlations are higher than seen for human genes or for randomly permuted control data sets. 
Overall, Alu subfamilies show the highest positive correlations of element mRNA and sRNA levels across tissues, although 
a few of the youngest, and potentially most active, Alu subfamilies do show negative correlations. Thus, Alu derived 
sRNAs may be related to both genome regulation and genome defense. These results are inconsistent with a simple 
model whereby TE derived sRNAs reduce levels of standing TE mRNA via transcript cleavage, and suggest that human 
cells efficiently process TE transcripts into sRNA based on the available message levels. This may point to a widespread 
role for processed TE transcripts in genome regulation or to alternative roles of TE-to-sRNA processing including the 
mitigation of TE transcript cytotoxicity. 




Introduction 

Eukaryotic genomes harbor numerous transposable element (TE) 
sequences that are capable of moving from one location in the 
genome to another. This transpositional activity entails the 
genomic insertion of relatively large sequences and often leads to 
highly deleterious mutations. TE insertions can cause protein 
coding sequence mutations or premature termination of trans- 
cription in gene regions, can disrupt normal patterns of gene 
expression by targeting regulatory sequences and can lead to 
chromosomal breakage and re-arrangements.^'^ Thus, TEs can be 
extremely mutagenic, and so genomes must have some way to 
control their activity. 

A variety of transposition repression mechanisms have evolved 
to mitigate the threat that TEs pose to genome integrity. ^'^ These 



include DNA methylation,^'^'^ repressive histone modifications,'^"^^ 
the activity of cytosine deaminases and DNA repair proteins^^'^'^ 
and even the physical elimination of TE sequences from the 
genome. In addition, results from recent studies are taken to 
point to a number of small RNA (sRNA) based mechanisms 
that may be are employed for the repression of TEs.^^ sRNAs 
refer to a number of different short RNA species processed from 
longer transcripts such as Dicer-dependent short interfering 
RNAs (siRNAs) or Dicer-independent PlWI-interacting RNAs 
(piRNAs). For example, the RNA interference (RNAi) pathway 
in Caenorhabditis elegans uses TE-derived sRNAs generated from 
double-stranded RNA (dsRNA) by Dicer to represses the 
transposition of DNA- type elements. In Drosophila, piRNAs 
processed from TEs via a distinct "ping-pong" amplification 
method are used to repress transposition in the germline thereby 
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blocking the inheritance of TE-induced mutations and safe- 
guarding development. ^^"^^ TE-derived sRNAs in mouse are 
used to repress the transcription of retrotransposons in oocytes. 

Close to 50% of the human genome sequence is derived from 
TEs.^"^ While the vast majority of these elements are no longer 
capable of transposition, there remain a handful of active ele- 
ments, LINE-1 (LI) and Alu sequences for the most part,^^ that 
pose a substantial mutagenic threat. Work done on Lis provides 
the best characterized example of sRNA regulation for a human 
TE.^^ Full-length, potentially active LI elements encode an 
antisense promoter in their 5' UTR.^^'^^ Bi-directional transcrip- 
tional activity from both the canonical LI sense promoter and 
the anti-sense promoter leads to the production of dsRNA, which 
is processed into LI -specific sRNAs.^^ These LI sRNAs were 
shown to repress transposition by degrading full-length LI mRNA 
transcripts. Thus, for human Lis an inverse correlation has been 
observed between the levels of LI mRNA and element sRNA. 

In light of this work on the sRNA regulation of human Lis, we 
hypothesized that if the predominant role of TE-derived sRNAs 
is to repress transposition by means of transcript cleavage, as the 
levels of TE-specific sRNA go up, there should be a concomitant 
decrease in TE mRNA levels genome-wide. If this is the case, we 
expect to observe a negative correlation between TE mRNA and 
TE sRNA levels. On the other hand, if TE generated sRNAs 
are primarily being utilized by the genomes in which they reside 
to facilitate the regulation of host genes, one may expect to see 
a positive correlation between levels of TE-derived mRNA and 
sRNA. This would suggest that TE-derived transcripts are 
efficiently processed by the host cellular machinery, based on 
available levels of RNA messages, in a way that does not reduce 
the overall efficacy of TE expression. Under this scenario, TEs 
would be dynamically regulated to express transcripts that are 
destined to be processed and function in sRNA based cellular 
regulatory pathways as opposed to simply serving as transposition 
intermediates. 

Consistent with a potential role for TE transcripts in genome 
regulation, it has recently been shown that human TEs initiate 
transcription on a massive scale and are also dynamically regu- 
lated among different cell types; this includes the expression of 
numerous relatively ancient TEs that are no longer capable of 
transposing.^^ Furthermore, there are several recent examples 
illustrating that TE-derived sRNAs can in fact regulate host genes. 
In Drosophila melanogaster, TE-derived piRNAs play a critical 
role in embryonic patterning by targeting a specific host gene 
message. piRNAs derived from the roo and 412 retrotrans- 
posons facilitate cleavage of the nos mRNA via interactions with 
its 3' UTR thereby establishing a posterior- to-anterior gradient 
that is critical for proper head and thorax segmentation. In the 
human genome, TE-derived miRNAs^^ have been shown to play 
diverse roles in cancer by regulating both tumor suppressor^^ and 
ocogenes.^^ 

In an attempt to distinguish between these two roles for TE- 
derived sRNAs in the human genome, namely whether TE 
sRNAs serve primarily as genome defenders or as genome 
regulators, we explored the relationship between levels of TE 
mRNA and TE sRNA across six tissues. We found that levels of 



TE-derived mRNA and sRNA are positively correlated across 
different tissues, with gene-rich Alu elements showing the 
strongest correlations. Despite previous work showing an inverse 
relationship between LI element expression and the generation 
of sRNAs,^'' LI mRNA levels were also positively correlated 
with levels of sRNA. These data are not consistent with the 
widespread cleavage of TE mRNA by TE sRNA, and raise the 
possibility that numerous TE-derived transcripts are processed to 
yield sRNAs that function to regulate the host genome. 

Results 

Mapping of human mRNA and sRNA sequence data. Levels 
of mRNA and sRNA were compared across human tissues for 
individual genes and TE subfamilies. To do this, we used publicly 
available paired sets of mRNA and sRNA data generated with 
high-throughput sequencing techniques from six human tissues: 
brain, heart, kidney, liver, lung and skeletal muscle (Supple- 
mentary Table SI). Sequence tags were mapped to the human 
genome reference sequence and co-located with genes and TEs 
as described in the Materials and Methods section. A recently 
developed algorithm for mapping ambiguous tags was used to 
ensure maximal coverage of repetitive TE sequences for the short 
sequence tags used.^^ This algorithm ensures that the best single 
genomic location for each multi-mapping tag is chosen, thus 
ensuring deeper coverage of TE sequences than would be achieved 
if multi-mapping tags were discarded. In addition, a series of 
quality controls designed for high-throughput sequence data were 
implemented to ensure the reliability of the sequences used 
(Figs.Sl-3). 

Results of the tag-to-genome mapping for the six human 
tissues analyzed here are shown in Table 1. There were ^26- 
134 million reads for the mRNA libraries and --3—7 million 
reads for the sRNA libraries. After processing reads to eliminate 
adaptor sequences, sRNA sequences mapped to the human 
genome with extremely high fidelity. The majority of sRNA reads 
mapped to known miRNA loci, and ^1-2% mapped to TE 
sequences. mRNA reads mapped to the genome with lower 
fidelity, but a greater percentage mapped to TEs. The vast 
majority (90%) of sRNA sequence tags analyzed here were 19- 
24 nt in length suggesting that they are miRNAs or endogenous 
siRNAs, as opposed to longer piRNAs, as can be expected since 
they were isolated from somatic tissue (Fig. S4). In mammalian 
genomes, small RNA based regulation of TEs is primarily attri- 
buted siRNAs as opposed to piRNAs, which appear to function 
in TE control exclusively in the male germline.^^ 

Correlation of mRNA and sRNA levels for genes and TEs. 
For individual genes and individual TE subfamilies, mRNA vs. 
sRNA levels were regressed and the resulting correlation coeffi- 
cients and slopes were determined (Fig. IB). Regressing mRNA 
and sRNA levels across tissues in this way controls for any 
differences in the library preparations used prior to high- 
throughput sequencing since relative levels of expression are 
compared. The distributions of the correlation coefficients and 
slopes were then evaluated to determine the overall relationships 
between mRNA and sRNA levels across tissues for genes and TEs 
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Table 1. Results of the tag-to-genome mapping for mRNA and sRNA sequence libraries for six human tissues 





Reads per 
tissue 


Reads after 
clipping 


Reads that 
map to hg18 


% reads 
mapped 


Reads that 
map to TEs 


% of mapping 
reads that 
map to TEs 


Reads that 
map to genes 


% of mapping 

reads that 
map to genes 


mRNA 


brain 


34,493,914 


n/a 


28,389,338 


82.3 


1,001,006 


3.5 


24,194,582 


85.2 


heart 


40,338,602 


n/a 


32,751,816 


81.2 


571,069 


1.7 


26,665,851 


81.4 


kidney 


83,696,940 


n/a 


42,051,713 


50.2 


3,828,41 1 


9.1 


33,587,016 


79.9 


liver 


125,090,140 


n/a 


73,281,292 


58.6 


6,769,796 


9.2 


64,212,056 


87.6 


lung 


25,862,057 


n/a 


19,808,655 


76.6 


3,138,208 


15.8 


16,434,340 


83.0 


muscle 


45,280,908 


n/a 


36,984,450 


81.7 


919,399 


2.5 


32,413,952 


87.6 


1 sRNA 


brain 


5,021,339 


2,977,817 


2,939,957 


98.7 


33,102 


1.1 


2,452,355 


83.4 


heart 


5,901,910 


4,937,144 


4,921,992 


99.7 


42,284 


0.9 


4,701,738 


95.5 


kidney 


2,869,903 


2,135,001 


2,108,413 


98.8 


23,959 


1.1 


1,720,229 


81.6 


1 liver 


6,312,578 


3,448,077 


3,422,122 


99.2 


74,695 


2.2 


860,191 


25.1 


lung 


7,294,106 


4,808,564 


4,709,583 


97.9 


62,764 


1.3 


3,652,715 


77.6 


muscle 


3,793,410 


3,537,750 


3,532,680 


99.9 


38,019 


1.1 


3,458,249 


97.9 






1 1 














(Fig. IB). In 


particular, we 


sought to evaluate whether there 


was We also compared the correlation coefficient and slope distri- 



an overall negative or positive relationship between mRNA and 
sRNA levels for TE subfamilies in order to distinguish between 
the genome defense vs. genome regulator hypotheses for the 
primary role of human TE sRNAs. 

The distribution of correlation coefficients for 760 human TE 
subfamilies is highly skewed toward the positive end with the 
peak value closest to a perfect correlation of 1 (Fig. 2A). The 
distribution is substantially different from a control distribution 
generated by randomly shuffling mRNA and sRNA vectors for 
TE subfamilies, which is far more bell shaped with a peak just 
below 0 (Fig. 2A). The distribution of correlation coefficients for 
genes is also skewed toward the positive end of the scale but 
the effect is far less pronounced than seen for TEs (Fig. 2B). TE 
subfamilies show a median mRNA vs. sRNA correlation coeffi- 
cient of 0.62, which is significantly greater than seen for human 
genes or for the random control (TEs X genes W = 2.6 X 10^, 
p < 10-^'; TEs X control W = 4.7 x 10^, p < lO"'"). In other 
words, human TE mRNA and sRNA levels show a more con- 
sistently positive relationship than seen for genes or than can be 
expected by chance given the underlying data values being 
analyzed. 

A similar set of patterns are observed when the distributions 
of the slopes of the linear regression lines are considered 
(Fig. S5). Although the shapes of the observed vs. random 
control distributions are more similar, the observed TE sloped 
distribution is shifted to the right indicating that mRNA vs. 
sRNA slopes are greater than would be expected by chance 
alone. The median TE slope value is also significantly higher 
than seen for genes or for the random control (TEs vs. genes 
W= 3.3 X 10^ p = 9.1 X 10"^; TEs vs. control W= 4.5 x 10^, 
p < 10~^^). Thus for human TEs, as mRNA levels increase, 
sRNA levels increase more precipitously than seen for human 
genes. 



butions for the most abundant individual TE families or classes: 
LTR elements, DNA-type elements (i.e. cut-and-paste trans- 
posons), LI and Alu. LTR, DNA and LI groups all show similar 
median positive correlation coefficient values, whereas Alu has 
a significantly higher median value than the rest (Fig. 3A; Alu 
verus LTR W = 8808 p = 0.01). The pattern seen for the 
comparison of slopes is similar with Alus having an even more 
pronounced difference from the other TE families (Fig. 3B; Alu 
vs. LI W= 3369 p = 6.7 X lO"''). 

Discussion 

Genome defense vs. genome regulation. sRNA regulatory 
pathways are thought to be critical for the control of TEs,^'^^ 
and accordingly TE-derived sRNAs have mainly been considered 
in light of this paradigm. In this report, we evaluated the 
relationship between levels of human TE mRNA and TE sRNA 
in attempt to try and discriminate between this classic view on the 
role of TE sRNAs and the alternative possibility that TE sRNAs 
play functional roles for the host, i.e. the genome defense vs. 
genome regulation hypotheses. To do this, we built upon the logic 
of previous studies of human TE silencing based on TE sRNAs. 
In the human genome, sRNAs were previously shown to defend 
the genome against transposition by repressing the expression of LI 
TEs.^'' In this case, an increase in LI generated sRNA levels led to a 
decrease in element mRNA levels via transcript cleavage. We sought 
to evaluate whether a similar inverse relationship between TE mRNA 
vs. sRNA levels could be seen across TE subfamilies genome-wide. 
On the contrary, we found that TE mRNA and sRNA levels are 
positively related (Fig. 2; Fig. S5), consistent with a possible role for 
TE-derived sRNAs in genome regulation. 

The higher average correlation coefficient and slope values 
seen for the relatively young Alu family of TEs (Fig. 3) was an 
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Figure 2. mRNA vs. sRNA correlation coefficient distributions for human 
TE subfamilies and genes across six tissues. (A) Observed (blue) and 
randomized (red) correlation coefficient distributions for TE subfamilies. 
(B) Observed (blue) and randomized (red) correlation coefficient 
distributions for genes. (C) Correlation coefficient median ± standard 
error values for TE subfamily and gene observed (blue) vs. random (red) 
distributions. 



Figure 1. Scheme of the analytical pipeline and tools presented herein. 
(A) Analytical pipeline overview. (B) Example of the linear regression and 
correlation analysis used to compare mRNA vs. sRNA levels for individual 
TE subfamilies and genes across six human tissues. (C) Example of the 
distribution of the resulting correlation coefficients for all genes. 



unexpected observation. If TE-derived sRNAs are being used 
primarily to degrade mRNA transcripts in order to defend the 
genome against transposition, one may expect that the youngest 
and most potentially active TE subfamilies would show the most 
pronounced negative correlation between mRNA and sRNA 



levels. Similarly, if older elements that are no longer capable of 
transposing have been domesticated to transcribe RNAs with 
functional utility for the host, then those element families should 
show higher mRNA-to-sRNA positive correlations. This was 
clearly not the case here. However, when individual Alu element 
subfamilies were considered separately younger AluY subfamilies 
did show some evidence for genome defense by virtue of having 
negative TE mRNA-to-sRNA correlations; in fact, AluY sub- 
families were the only ones to show such negative correlations. 
For example, the youngest AluY subfamily, AluYb with an 
estimated age of 1.9 my, has a TE mRNA-to-siRNA correlation of 
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LTR DNA L1 Alu 



Figures. Median ± standard error values for the (A) correlation 
coefficient and (B) slope distributions for individual TE family (classes). 



r = -0.30. Furthermore, when the relative ages of all Alu 
subfamilies are considered with respect to their TE mRNA-to- 
sRNA correlations, younger families overall show lower correla- 
tion values (Alu subfamily age vs. TE mRNA-to-siRNA 
correlation r = 0.43, t = 2.7, p = 0.01). Thus, for Alus there is 
evidence in favor of both genome defense and genome regulation 
hypotheses with respect to the roles of TE sRNA. These results 
are consistent with a variety of roles in genome regulation and 
organization that have been ascribed to Alu element sequences 
and transcripts. ^'''^^ LI subfamilies, on the other hand, do not 
show any evidence for genome defense when analyzed in a similar 
way. 

Our results showing a positive correlation between TE mRNA 
and TE sRNA levels are consistent with two recent observations 
that also suggest that TE sRNAs should be considered with 
respect to possible roles that they may play in genome regulation. 
First of all, TEs were shown to be highly transcribed and 
dynamically regulated in the human and mouse genomes. This 
includes numerous ancient TEs that are no longer capable of 
transposition and thus would not need to be repressed by their 
host genome. Second, it has recently been shown that TE-derived 
sRNAs can directly interact with host genes to regulate their 
expression. This has been seen for TE-derived piRNAs in 
Drosophila^^ and for TE-derived miRNAs in human.^^'^^ 



We would like to emphasize that the correlations observed 
here do not equal causation. Rather, the results we obtained 
point to the possibility that TE-derived sRNAs play some role in 
genome regulation. Nevertheless, we feel that the data reported 
here represent an important and worthwhile observation in light 
of the emphasis currently placed on sRNA based TE repression 
mechanisms. 

Alternative roles for TE transcript processing. TE transcript 
processing by enzymes such as Dicer is typically thought to be 
related to the repression of transposition. However, it may also 
be possible that TE transcripts need to be efficiently processed 
to mitigate some other non-transposition related threats that 
they pose to the cells. In other words, accumulation of the TE 
transcripts themselves, or simply dysregulation of the TEs, may 
be toxic to the cellular environment and cells may efficiently 
process TE transcripts to mitigate this toxicity. For example, 
accumulation of unprocessed Alu transcripts based on Dicer 
deficiency has been linked to age-related macular degeneration 
in humans.^^ Dysregulated Alu transcription has also been related 
to the senescence of adult human stem cells, and sRNA based 
silencing of Alu transcription restores the self-renewing pheno- 
type of these cells. If organisms have evolved efficient mecha- 
nisms that process TE transcripts to mitigate their toxicity, one 
might also expect to see the kinds of positive correlations 
between TE mRNA and sRNA levels reported here across cellular 
phenotypes. 

It may also be the case that sRNA based cleavage of TE 
transcripts for the purposes of repression of transposition does not 
necessarily lead to the predicted negative correlations between 
sRNA and mRNA levels. sRNA based silencing mechanisms are 
used to repress TE expression and transposition in Arabidopsis 
thaliana gametes. TEs are expressed in the vegetative nucleus cells 
of A. thaliana pollen but not in the sperm cells that pass on 
genetic material to successive generations.^^ Apparently, the TEs 
that are expressed in the vegetative nucleus are efficiently 
processed to yield sRNAs in accordance with the availability of 
full-length TE messages. In this case, it was proposed that TE 
activation in the vegetative nucleus may be used to provide sRNAs 
that are passed to the sperm cells to repress transposition therein. 
In other words, the repression mechanism is indirect in the sense 
that TEs from one nucleus are activate to provide sRNAs for TE 
silencing in another nucleus. This kind of mechanism could lead 
to positive correlations between TE mRNA and TE sRNA levels 
across cellular compartments with TE derived sRNAs exerting 
their repressive effects elsewhere in the organism. 

Finally, it is worth noting that the two possible roles for TE- 
derived sRNAs are not mutually exclusive. It is clearly a fact that 
TE sRNAs are used to repress transposition, but it is becoming 
increasingly evident that TEs are widely expressed and dynami- 
cally regulated to yield non-coding RNAs, which in turn can be 
efficiently processed into sRNAs that interact with host genes to 
affect their regulation. The genome-scale results reported here 
suggest that the second view warrants serious consideration and 
raise the possibility that sRNA based mechanisms may have 
initially evolved to repress transposition but now serve primarily 
in genome regulation. 
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Materials and Methods 

RNA sequence data and mapping. The levels of mRNA and 
sRNA for human TEs and genes analyzed in this study are based 
on a series of previous RNA-seq studies for full-length trans- 
cripts^^-^5 ^j^j ^j^Qj.^ RNAs^^ (Table SI), and the mRNA and 
sRNA sequence read data from these studies were obtained from 
the NCBI Sequence Read Archive (SRA - http://www.ncbi.nlm. 
nih.gov/sra). mRNA and sRNA data were analyzed from six 
human tissues: brain, hear, liver, lung, kidney and skeletal muscle. 
All RNA sequences analyzed here were characterized using the 
Illumina platform under the conditions described in Table SI. 
mRNA sequences were isolated from total RNA using oligo-T 
magnetic beads, and sRNA sequences were isolated from total 
RNA using 18-35 nt size fractionation. 

Quality control analysis of RNA sequence data was done 
using the FastQC program (www.bioinformatics.bbsrc.ac.uk/ 
projects/fastqc/), and only tags within the expected size range 
(19-24 nt) for miRNA or siRNA were considered for subsequent 
analysis. RNA sequence reads were mapped to the human genome 
reference sequence (NCBI36/hgl8) using the program Bowtie'^'' 
with a threshold of ^ 2 mismatches allowed. The most likely 
mapping locations for reads that mapped to more than one 
location were rescued using the Gibbs sampling strategy for 
multi-mapping tags.^^ mRNA and sRNA sequence tags mapped 
and processed in this way were co-located with human gene and 
TE loci annotated in the UCSC Genome Browser.^^ The 
locations of human genes were taken from the Known Genes 



track^^ and the locations of human TEs, along with their class/ 
family/subfamily designations, were taken from the RepeatMasker 
track.5« 

Statistical analysis. For each TE subfamily and each gene locus, 
tissue-specific reads per million (RPM) counts were computed for 
mRNA and sRNA. Then for each TE subfamily (n = 903) and 
each gene (n = 25,246), least squares linear regression was used to 
compare mRNA vs. sRNA levels across the six tissues, and the 
correlation coefficient and slope values were determined. A 
matched series of random correlation coefficients and slopes were 
calculated by randomly shuffling the underlying tissue-specific 
mRNA and sRNA RPM counts for each TE subfamily and each 
gene and performing the same linear regression analysis. Median 
values for the distributions of the correlation coefficient and slope 
values were compared using the Wilcox rank sum test. 
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